Today's Menu¶
- Audio Signal Processing
- Theory
- A practical example
- Your next tasks
- Presentations
Are we complete?¶
- Team Task 1 (Ioan-Cristian, Dominik, Michael, Anna)?
- Team Task 4 (Aleksandr, Fabian, Lara, Maximilian)?
- Team Task 6 (Giovanni, Fabian, Youniss)?
- Team AES: QVIM (Andrei, Anton Atanasov, Aleksandar)?
Tips & Tricks for building an ML pipeline [Wednesday, April 2]¶
Pipeline - Expectation¶
Pipeline - Reality¶
Best Practices for ML Pipelines¶
Document everything
Your future self (and your teammates) should understand what you did, why you did it, and how it works.Double-check your steps
Small mistakes (e.g., data leakage, wrong labels, off-by-one errors) can silently propagate and waste hours later.Log everything
Track parameters, results, errors, and unexpected behaviors. Good logs = easy debugging.Use version control
Commit early, commit often. Track your code and configs to understand changes over time.Automate where possible
Make your pipeline reproducible. E.g., use scripts to run multiple commands.Keep it simple (at first)
Start with a minimal working version, then iterate. Don’t over-engineer early.
Documentation¶
Maintain a continuously updated work log
Your colleagues (and future you) should be able to quickly understand what you've done and why.A good work log saves time when writing your technical report
You’ll thank yourself later when everything is already written down.Store your work log in version control
Or at least use a shared document if collaborating—keep everything accessible and trackable.Collect questions for the teaching team in your work log
This helps identify common issues and makes discussions more efficient.Add comments to your code, focusing on why things are done
Don't just write what the code does—explain the reasoning behind key decisions.
Version Control¶
Use Git — it's the standard tool for version control
Learn the basics well (commits, branches, merges, resolving conflicts).When collaborating, work on your own branches
This avoids conflicts and allows for parallel development.Merge frequently
Don’t let branches drift too far apart — regular merges help catch issues early.Keep the main/master branch stable and runnable
Treat it as your working baseline; don’t break it.Review new features as a group before merging
Ensure everyone is on the same page and avoid unexpected issues in shared code.
Testing & Sanity Checks¶
Run small, regular tests to avoid wasting time debugging later:
Exploratory tests
Try out unfamiliar libraries or functions in a Jupyter Notebook to quickly learn how they behave.Sanity checks for your own code
Test functions like data loading, augmentation, or feature extraction on a few samples. Print shapes, min/max, or visualize outputs.Visual inspection
Plot spectrograms, embeddings, or model outputs to ensure your pipeline behaves as expected.Overfit a tiny batch
A classic trick: your model should be able to overfit 1–2 training examples. If it can’t, something’s likely wrong.Group review of key changes
Before running full experiments, verify new code together — four eyes catch more bugs than two.
Reproducibility: What Can Break It?¶
Code & Dependencies
- Changing library versions (e.g., PyTorch, NumPy) can change behavior
- Solution: use virtual environments (e.g.,
conda,venv) and freeze dependencies (requirements.txtorenvironment.yml)
Training Pipeline
- Pseudo-random number generators (PRNGs): Python, NumPy, PyTorch, etc.
- Parallelism (e.g., multi-GPU, DataLoader workers) can introduce variability
- Non-deterministic GPU ops (e.g., certain cuDNN kernels, FP16 ops)
Data Preprocessing
- Augmentations, shuffling, random splits, and label generation must be deterministic
- Even slight differences (e.g., rounding, file order) can change results
Reproducibility: How to make deterministic?¶
# Set all PRNGs
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)
# Enforce deterministic ops (optional, slows things down)
torch.use_deterministic_algorithms(True)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
pl.seed_everything(SEED, workers=True) # Let Lightning seed everything, including workers
loader = DataLoader(
dataset,
shuffle=True,
generator=torch.Generator().manual_seed(SEED), # Reproducible DataLoader shuffling
...
)
log = {"seed": SEED, "git_commit": ..., "torch_version": torch.__version__, ...}
Data Leakage¶
- Always use the provided split for your task
- Ensure that no information from the test set leaks into training or validation. This includes label distribution, normalization stats, data augmentations, etc.
- Normalization leakage is a common pitfall: Don’t compute mean/std over the full dataset — only use training data stats.
- Best practice: Treat your test set as untouchable — only access it once, at the very end, for final evaluation.
Leakage across devices¶
- In DCASE Task 1, recording devices are a major source of domain shift.
- Measuring generalization performance to unseen devices is crucial.
- Splitting data without considering the recording device causes leakage and may inflate your results. Use the provided split, which accounts for device separation.
PyTorch Lightning: Minimal Interface¶
class MyModel(pl.LightningModule):
def __init__(self, n_classes=10):
super().__init__()
self.model = torch.nn.Linear(128, n_classes) # example model
self.validation_step_outputs = []
def forward(self, x):
return self.model(x) # inference step
def training_step(self, batch, batch_idx):
x, y = batch
loss = F.cross_entropy(self(x), y)
return loss
def validation_step(self, batch, batch_idx):
x, y = batch
self.validation_step_outputs.append(...)
return val_loss
def on_validation_epoch_end(self):
self.log("val/accuracy", ...)
self.validation_step_outputs.clear()
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=1e-3)
if __name__ == "__main__":
model = MyModel(n_classes=10)
train_loader, val_loader = ... # define DataLoaders
trainer = pl.Trainer(max_epochs=10)
trainer.fit(model, train_loader, val_loader)
Weights and Biases: Minimal Interface¶
def main(config):
wandb_logger = WandbLogger(
project="my-project",
name=config.experiment_name,
config=vars(config) # logs all argparse args as W&B config
)
trainer = pl.Trainer(
max_epochs=config.n_epochs,
logger=wandb_logger
)
...
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--lr", type=float, default=1e-3)
parser.add_argument("--n_epochs", type=int, default=10)
parser.add_argument("--experiment_name", type=str, default="minimal-wandb-run")
args = parser.parse_args()
try:
commit_hash = subprocess.check_output(["git", "rev-parse", "HEAD"]).decode("ascii").strip()
except Exception:
commit_hash = "unknown"
args.commit_hash = commit_hash # include in logged config
# save library versions for reproducibility
args.versions = {
"torch": torch.__version__,
...
}
main(args)
Audio Signal Processing¶
Audio Signal Processing¶
Disclaimer:¶
The following section is a practical introduction to audio signal processing for deep learning. It is meant to help you:
- Understand common steps and hyperparameters used in audio pipelines
- Interpret preprocessing choices in research papers and codebases
What this is not:
- A deep dive into the mathematical theory behind signal processing
- A substitute for a full course on digital signal processing (DSP)
If you're curious to go deeper into the math, we highly recommend
📘 The Scientist and Engineer's Guide to Digital Signal Processing
Overview – Theory¶
Let's do a speedrun across the theoretical foundations of a typical audio signal processing pipeline:
- Sound and Digital Audio Signals: What is sound, and how do we convert it into a digital signal?
- Digital Filters: Tools to process the digital signal in time or frequency domain
- Discrete Fourier Transform (DFT): Analyze a digital signal in terms of its frequency components
- Magnitude Spectrum: Convert the complex spectrum into a magnitude spectrum
- Frequency Resolution: Understand the frequency spacing and resolution of the spectrum
- Spectral Leakage and Windowing: Use window functions to reduce spectral leakage
- Short-Time Fourier Transform (STFT): Convert a time-domain signal into a time–frequency representation
- Mel Transform: Compress the frequency axis based on human auditory perception
- Logarithmic Compression of Amplitude: Compress the dynamic range → log mel spectrogram
An example¶
Our starting point: ... a dog barking
sr = 32000
example_wav, _ = liro.load(example_file, sr=sr)
wav_plot(example_wav, sr)
Long Story Short¶
from torchaudio import transforms
waveform, sample_rate = torchaudio.load(example_file)
transform = transforms.MelSpectrogram(sample_rate, n_fft=800, n_mels=80)
mel_specgram = transform(waveform)
mel_specgram = torch.log(mel_specgram + 1e-5)
plot_spectrogram(mel_specgram, waveform, sample_rate)
Sound and Digital Audio Signals¶
- Sound: variation in air pressure at a point in space as a function of time
- The microphone turns the mechanical energy of a soundwave into an analog electrical signal
- To process it with a computer, we convert it into a digital signal, which involves:
- Any ideas?
Sound and Digital Audio Signals¶
- Sound: variation in air pressure at a point in space as a function of time
- The microphone turns the mechanical energy of a soundwave into an analog electrical signal
- To process it with a computer, we convert it into a digital signal, which involves:
- Sampling: measuring the signal at regular time intervals (sampling rate, e.g., 44,100 times per second)
- Quantization: rounding each sample to a fixed set of amplitude levels (bit depth, e.g., 16-bit resolution)
The Sampling Theorem¶
If the signal is "properly sampled", it can be reconstructed exactly from its samples
A continuous (analog) signal is "properly sampled" if it contains no frequency components above half the sampling rate
- This limit is called the Nyquist frequency
- Example: for a sampling rate of 44,100 Hz → Nyquist frequency = 22,050 Hz
Frequencies above the Nyquist limit will be aliased — incorrectly folded into lower frequencies: → results in distortion and irrecoverable information loss
To prevent aliasing: Apply an analog low-pass filter before digitizing the signal
The Sampling Theorem¶
Digital Filters¶
Digital filters are essential tools in audio processing
→ used to enhance, suppress, or separate signal componentsCan operate in the time or frequency domain
Defined by their impulse response or frequency response
Two main types:
- FIR (Finite Impulse Response) → implemented via convolution
- IIR (Infinite Impulse Response) → uses recursion (feedback)
See the pre-emphasis filter in the example pipeline
Discrete Fourier Transform (DFT)¶
- The DFT can be used to analyze digital signals in terms of their frequency components
- It assumes the signal is finite, periodic, and repeats infinitely in both directions
- The result is a set of complex coefficients, each representing a sinusoidal basis function:
$$ X[k] = \frac{1}{N} \sum_{n=0}^{N-1} x[n] \cdot e^{\frac{-2\pi i k n}{N}} $$
- $X[k]$: complex amplitude of the $k$-th frequency bin
- $N$: number of time samples
- The exponent represents a complex sinusoid (a rotating vector)
Discrete Fourier Transform (DFT)¶
Using Euler’s formula:
$$ e^{ix} = \cos(x) + i \sin(x) $$
we can write:
$$ X[k] = \frac{1}{N} \sum_{n=0}^{N-1} x[n] \cdot \cos\left(\frac{-2\pi k n}{N}\right) + i \cdot x[n] \cdot \sin\left(\frac{-2\pi k n}{N}\right) $$
- The real part corresponds to how much of a cosine wave (frequency $k$) is present in the signal
- The imaginary part corresponds to how much of a sine wave (same frequency $k$) is present
$\rightarrow$ The DFT tells us how much of each sine and cosine wave at frequency $k$ is needed to reconstruct the signal.
Discrete Fourier Transform (DFT)¶
- The time-domain signal consists of $N$ samples: $x[0] \dots x[N-1]$
- The DFT translates this into $\frac{N}{2} + 1$ sine (imaginary) and cosine (real) amplitudes
- The DFT can be computed via:
- Matrix multiplication with sine/cosine basis → $O(N^2)$
- Or much faster using the Fast Fourier Transform (FFT) → $O(N \log N)$
Magnitude Spectrum¶
Each DFT coefficient $X[k]$ is a complex number with a real part (cosine component) and an imaginary part (sine component)
These can be converted to polar form:
- Magnitude: $ |X[k]| = \sqrt{(\mathrm{Re}\,X[k])^2 + (\mathrm{Im}\,X[k])^2} $
- Phase (angle): $ \phi[k] = \arctan2(\mathrm{Im}\,X[k], \mathrm{Re}\,X[k]) $
Magnitude Spectrum¶
In many applications:
- We use only the magnitude spectrum— or more often, the power spectrum (squared magnitude)
- The phase is often ignored — the human ear is mostly insensitive to phase, except in some edge cases (e.g., localization, transients)
The magnitude of bin $X[k]$ reflects how much energy is present in its frequency band
Frequency Resolution¶
- An $N$-point FFT (typically with $N$ as a power of 2) produces $\frac{N}{2} + 1$ frequency bins for real-valued input
- These bins are uniformly spaced from $0$ to $\frac{S}{2}$ Hz, where $S$ is the sampling rate
- The frequency spacing between bins is: $\Delta f \approx \frac{S}{N}$ → This is often called the frequency resolution
Example:¶
- Sampling rate: $S = 32,000$ Hz
- FFT size: $N = 1024$
$$\Delta f \approx \frac{32000}{1024} \approx 31.25\ \text{Hz}$$
→ You get 513 bins, each representing a frequency band ~31.25 Hz wide, from 0 Hz up to 16,000 Hz
Frequency Resolution¶
- To get more finely spaced bins in the frequency domain:
- You can increase $N$ by zero-padding the signal (appending zeros)
- This improves visual resolution, but not true spectral resolution
Spectral Leakage and Windowing¶
- The DFT assumes the signal is periodic over the analysis window
→ It treats the time-domain signal as if it's infinitely repeated
- If a sinusoid does not complete an integer number of cycles within the window:
- It gets cut off at the edges
- This introduces discontinuities at the window boundaries
Spectral Leakage and Windowing¶
- These sharp edges cause spectral leakage:
- Energy from a single frequency spreads into many DFT bins
- This effect requires many basis functions to explain the edge artifacts
- To reduce discontinuities at the window edges, we multiply the signal with a window function
→ This smoothly tapers the signal to zero at the edges
→ Prevents sharp "cuts" that cause spectral leakage
Spectral Leakage and Windowing¶
With proper windowing, sine waves that don’t perfectly align with DFT bins produce cleaner, more localized peaks
Common window functions:
- Hann, Hamming, Blackman, Gaussian, etc.
There’s a trade-off between narrow peak and low side lobes
Short-Time Fourier Transform (STFT)¶
- A regular DFT tells us nothing about when things happen
- The Short-Time Fourier Transform (STFT) computes a time–frequency representation (spectrogram) by:
- Splitting the signal into short, overlapping windows
- Computing the DFT separately for each window
Short-Time Fourier Transform (STFT)¶
- Choose your window length wisely:
- Long window → better frequency resolution, worse time resolution
- Short window → better time resolution, worse frequency resolution
- Use a length where the signal is approximately stationary within a window
Mel Transform¶
After computing the STFT, we usually apply a Mel filterbank to transform the linear frequency spectrogram into a perceptually motivated, compressed representation
Motivation from human perception: The human ear does not perceive frequency linearly. We are more sensitive to changes in low frequencies than in high frequencies.
This can be implemented as a matrix multiplication:
$$ \mathrm{mel\_spec} = \mathrm{torch.matmul}(\mathrm{mel\_filterbank}, \mathrm{spectrogram}) $$Purpose:
- Reduce dimensionality of the spectrogram
- Emphasize features most relevant to human hearing
Mel Filterbank¶
Logarithmic Compression of Amplitude¶
The human ear does not perceive loudness linearly:
- A 10× increase in sound power results in approximately a 2× increase in perceived loudness
- This follows a power-law relationship: $ \text{Loudness} \propto \text{Power}^n $, with $ n \approx 0.3 $
- To handle the wide dynamic range of real-world sounds, we use a logarithmic scale — the decibel (dB) scale — which aligns well with human perception
To convert sound power ( P ) to decibels: $$ \text{dB} = 10 \cdot \log_{10}\left(\frac{P}{P_{\text{ref}}}\right) $$
Overview - Practical Example¶
Let's look at a typical preprocessing pipeline (see example pipeline on GitHub):
- Pre-emphasis filter: a simple FIR filter applied in the time domain to amplify high frequencies and flatten the spectrum
- Short-Time Fourier Transform (STFT)
- Power spectrogram: compute the squared magnitude of the complex STFT
- Mel Transform
- Logarithmic Amplitude compression → log mel spectrogram
An Example: from the waveform to the log mel spectrogram¶
Our starting point: ... a dog barking
sr = 32000
example_wav, _ = liro.load(example_file, sr=sr)
wav_plot(example_wav, sr)
Apply Pre-emphasis (Digital Filter)¶
In natural audio signals, low frequencies tend to dominate, with energy typically dropping ~2 dB per kHz
This spectral imbalance can mask important details in higher frequencies
A pre-emphasis filter is a simple FIR (Finite Impulse Response) filter applied in the time domain to:
- Flatten the spectral envelope
- Boost higher frequencies
Apply Pre-emphasis (Digital Filter)¶
preemphasis_coefficient = torch.as_tensor([[[-.97, 1]]])
wav_torch = torch.from_numpy(example_wav)
wav_pree = nn.functional.conv1d(wav_torch.reshape(1, 1, -1), preemphasis_coefficient).squeeze(1)
freq_plot(preemphasis_coefficient.squeeze().numpy(), sr, title="Pre-emphasis Filter Frequency Magnitude Reponse")
spec_liro(example_wav, sr, title="Log Spectrogram (without Pre-emphasis)")
spec_liro(wav_pree.squeeze().numpy(), sr, title="Log Spectrogram (with Pre-emphasis)")
STFT¶
n_fft, win_length, hop_length = 1024, 800, 320
window = torch.hann_window(win_length)
spec = torch.stft(wav_pree, n_fft=n_fft, hop_length=hop_length,
win_length=win_length, window=window,
return_complex=True)
print("Complex spec shape: ", spec.shape)
spec = torch.view_as_real(spec)
print("Real spec shape: ", spec.shape)
power_spec = (spec ** 2).sum(dim=-1)
# for comparison, we calculate also the magnitude spectrogram
mag_spec = torch.sqrt(power_spec)
Complex spec shape: torch.Size([1, 513, 1000]) Real spec shape: torch.Size([1, 513, 1000, 2])
STFT Window¶
wav_plot(window.squeeze().numpy(), sr, listen=False, title="Hann window (time-domain)")
spec_liro(mag_spec.squeeze().numpy(), sr, x_is_spec=True, convert_to_db=False, title="Magnitude Spectrogram")
spec_liro(power_spec.squeeze().numpy(), sr, x_is_power_spec=True, convert_to_db=False, title="Power Spectrogram")
Mel Transformation¶
n_mels, fmin, fmax = 40, 0.0, sr // 2
mel_basis, _ = torchaudio.compliance.kaldi.get_mel_banks(n_mels, n_fft, sr,
fmin, fmax,
vtln_low=100.0,
vtln_high=-500.,
vtln_warp_factor=1.0)
# pad with one zero per mel bin to match n_fft // 2 + 1
mel_basis = torch.as_tensor(torch.nn.functional.pad(
mel_basis, (0, 1), mode='constant', value=0)
)
print(mel_basis.shape)
torch.Size([40, 513])
fig, ax = plt.subplots(nrows=1, figsize=(10, 4))
ax.set_title("Mel filterbank")
ax.set_xlabel("FFT bin index")
ax.set_ylabel("Mel bin")
ax.imshow(mel_basis.squeeze().numpy(), cmap='hot', interpolation='nearest', aspect='auto')
plt.show()
melspec = torch.matmul(mel_basis, power_spec)
spec_liro(power_spec.squeeze().numpy(), sr, x_is_power_spec=True, title="Log Power Spectrogram")
spec_liro(melspec.squeeze().numpy(), sr, x_is_mel_spec=True, title="Log Mel Spectrogram")
Log the Amplitude¶
log_mel_spec = (melspec + 0.00001).log()
spec_liro(melspec.squeeze().numpy(), sr, x_is_mel_spec=True, convert_to_db=False, title="Mel Spectrogram")
spec_liro(log_mel_spec.squeeze().numpy(), sr, x_is_mel_spec=True, convert_to_db=False, title="Log Mel Spectrogram")
What to do with a log mel spectrogram?¶
Use your favorite vision architecture and treat the log mel spectrogram as an image with a single input channel.
Example Pipeline on GitHub¶
The example ML4Audio pipeline demonstrates the following points based on 200 wav files:
- Dataset loading, PyTorch Dataset class, PyTorch Dataloader
- Audio Signal processing routine that we discussed today
- How to use a PyTorch Model (CNN) to generate predictions based on a log mel spectrogram
- Simple data augmentation techniques (masking time frames, masking frequency bands, mixup, time rolling)
- A training loop implemented with Pytorch Lightning
- Logging implemented with Weights and Biases
- Some of the best practices we discussed
Your next tasks¶
Until 02.04.25 (one week)¶
Prepare a short presentation (10 minutes) introducing the baseline system for your task.
- The baseline code will be available by the end of this week or early next week for Tasks 1, 6, AES QVIM.
- For Task 4, the baseline may be released sometime next week. If your baseline is not yet available, prepare your presentation based on a relevant related system, and include the same key aspects listed on the next slides — as if it were your baseline.
Until 02.04.25 (one week)¶
- Aspects to look into:
- What datasets is the baseline trained on?
- What kind of input representation is used (e.g., log-mel spectrogram, waveform)?
- Are there any data augmentation techniques (e.g., mixup, SpecAugment, noise injection)?
- What neural network architecture is used?
- What loss function is used?
- What evaluation metric is used?
- On what validation/test split is performance reported?
- How long does it take to reproduce the baseline system?
- Are there any class imbalances or domain shifts the baseline does handle (or not)?
- What optimizer and learning rate schedule?
- What are the key hyperparameters (batch size, learning rate, etc.)?
- How far is it from the SOTA?
Until 09.04.25 (two weeks)¶
- Prepare a presentation (10 minutes).
- Successfully reproduce the baseline results and show us the logged metrics.
- Let us know soon if you encounter any problems.
- Tell us about your plans for improving the baseline over the easter break.